In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt
Now fundamentally the data frame is just an abstraction but it provides a ton of useful tools that you’re going to get to see. This video is just going to go over the basic idea of the data frame as well as how to create them.
In [4]:
import string
upcase = [x for x in string.ascii_uppercase]
lcase = [x for x in string.ascii_lowercase]
In [5]:
print(upcase[:5], lcase[:5])
You can create DataFrames by passing in np arrays, lists of series, or dictionaries.
In [6]:
pd.DataFrame([upcase, lcase])
Out[6]:
We’ll be covering a lot of different aspects here but as always we’re going to start with the simple stuff. A simplification of a data frame is like an excel table or sql table. You’ve got columns and rows.
In more specific pandas terms, it's a more powerful list of series. Each column is a Series of data and it just so happens these can have relationships.
You can see that if we just pass in a list of lists it treats them like columns. Of course if that’s an issue we can just transpose it and get we’ll get them as columns.
In [7]:
pd.DataFrame([upcase, lcase]).T
Out[7]:
This should be familiar because it’s the same way that we transpose ndarrays in numpy.
Of course we can also specify them as explicit columns but passing in a dictionary where the keys are the column names and the values are the lists of each item (or the rows).
In [8]:
letters = pd.DataFrame({'lowercase':lcase, 'uppercase':upcase})
letters.head()
Out[8]:
Now you’ll see that if these lengths are not the same, we’ll get a ValueError so it’s worth checking to make sure your data is clean before importing or using it to create a DataFrame
In [9]:
pd.DataFrame({'lowercase':lcase + [0], 'uppercase':upcase})
In [10]:
letters.head()
Out[10]:
We can rename the columns easily and even add a new one through a relatively simple dictionary like assignment. I'll go over some more complex methods later on.
In [11]:
letters.columns = ['LowerCase','UpperCase']
In [12]:
np.random.seed(25)
letters['Number'] = np.random.random_integers(1,50,26)
In [13]:
letters
Out[13]:
Now just like Series, DataFrames have data types, we can get those by accessing the dtypes of the DataFrame which will give us details on the data types we've got.
In [14]:
letters.dtypes
Out[14]:
In [15]:
letters.index = lcase
letters
Out[15]:
Of course we can sort maybe by a specific column or by the index(the default).
In [16]:
letters.sort('Number')
Out[16]:
In [17]:
letters.sort()
Out[17]:
We've seen how to query for one column and multiple columns isn't too much more difficult.
We can get upper and lower case columns
In [18]:
letters[['LowerCase','UpperCase']].head()
Out[18]:
We can also just query the index as well. We went over a lot of that in the Series Section and a lot of the same applies here.
We can query by index location or by letters
In [19]:
letters.iloc[5:10]
Out[19]:
In [20]:
letters["f":"k"]
Out[20]:
Now that we’ve covered this basic concept of pandas.
We covered how indexes integrate with both Series and DataFrames. We've covered how numpy underlies a lot of the power we've got and to be honest we've really covered a lot of the fundamental for doing data analysis with python and pandas.
Although these videos have been using fabricated data we have covered a lot of the methods that you’re going to be using on a regular basis during your analysis of data.
Let's go ahead and dive into our first data set